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Abstract 

We present a maehine learning algorithm for building elassifiers that are eomprised of a 
small number of disjunetions of eonjunetions (or’s of and’s). An example of a elassifier of 
this form is as follows: If X satisfies (xi = ‘blue’ AND X 3 = ‘middle’) OR (xi = ‘blue’ 

AND X 2 = ‘<15’) OR (xi = ‘yellow’), then we prediet that Y=l, ELSE prediet Y=0. An 
attribute-value pair is ealled a literal and a eonjunetion of literals is ealled a pattern. Models 
of this form have the advantage of being interpretable to human experts, sinee they produee a 
set of eonditions that eoneisely deseribe a speeifie elass. We present two probabilistie models 
for forming a pattern set, one with a Beta-Binomial prior, and the other with Poisson priors. 

In both eases, there are prior parameters that the user ean set to eneourage the model to have 
a desired size and shape, to eonform with a domain-speeifie definition of interpretability. We 
provide two sealable MAP inferenee approaehes: a pattern level seareh, whieh involves asso- 
eiation rule mining, and a literal level seareh. We show stronger priors reduee eomputation. 

We apply the Bayesian Or’s of And’s {BOA) model to prediet user behavior with respeet to 
in-vehiele eontext-aware personalized reeommender systems. 

Keywords: statistical learning and data mining, association rules, interpretable classifier, Bayesian 
modeling 

Reproducibility: all code and datasets will be made publicly available on acceptance 
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1 Introduction 


Our goal is to construct a model that not only classifies data but also “explains” data. This 
predictive classifieation model eonsists of a small number of disjunctions of conjunctions, that 
is, the elassifiers are or’s of and"s, whieh have both the classifieation and expressive power. This 
form of model has some reeent preeedent in [16,18,19,28] as a form of model that is natural 
for modeling consumer behavior, and interpretable to human experts. In particular, it has been 
hypothesized that people make purehasing deeisions using simple or’s of and’s models (e.g., 
“I would only purchase product X if it has options 1 and 2 or options 3 and 4."). The set of 
eonditions within the model should be sparse, as humans can handle about 7±2 eognitive entities 
at onee [32]. Beyond modeling human deeision-making, or’s of and's models can strike a nice 
balanee between aeeuracy and interpretability for general predictive modeling problems. 

We consider the elassifieation problem where we observe {x„, yn} pairs, where x„ is a veetor 
of real-valued, eategorieal, or binary attributes and yn G {0,1}. A literal is an attribute-value 
pair (e.g., a;i=‘blue’), denoted as r. A pattern is a eonjunctions of literals (e.g., a;i=‘blue’ AND 
a; 2 =‘< 5 ’ AND 0 : 3 =‘middle’), denoted as a. It has the form a = ri A r 2 A ..., where A denotes the 
and operation. We eall the number of literals in a the length of a. A pattern set is a disjunction 
of patterns, denoted as A. It has the form A = oi V 02 V ..., where V denotes the or operation. 
We call the number of patterns in A the size of A. 

We take a generative approaeh to the eonstruction of or’s of and’s elassifiers and introduce 
two models, a model with beta-binomial priors, called BOA-BetaBinomial, and a model with 
Poisson priors, ealled BOA-Poisson. Both models have priors that can be adjusted to suit a 
domain-speeifie notion of interpretability, as it is well-known that interpretability eomes in dif¬ 
ferent forms for different domains [3,15, 21, 29, 30, 39]. In particular, the prior parameters of 
the BOA-Poisson are the expeeted pattern lengths and pattern set size. The parameters of BOA- 
BetaBinomial are pseudo eounts for the number of conjunetions of each size to be seleeted within 
the model. That is, if the user desires conjunetions that have two conditions each, the pseudo- 
eounts for size two eonditions ean be inereased. Users ean set these parameters to obtain a model 
of an approximate desired size, though it is possible for the prior to be overwhelmed with data. 

We provide two inferenee methods. The first teehnique uses a eombination of association rule 
mining and simulated annealing to approximate the globally optima BOA maximum a posteriori 
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(MAP) model. This approach is motivated by a theoretical bound that allows us to reduce the 
size of the computationally hard problem of finding the MAP solution so that it may be more 
manageable to solve in practice. This bound states that we need only mine patterns that are 
sufficiently frequent in the database. The second method uses a literal-based stochastic local 
search where a neighboring point is proposed by changing a literal in the current pattern set. 
Each method has its own advantages and disadvantages: the first method has the disadvantage 
that it potentially requires generating and screening a huge number of patterns, but it is more 
likely to find an MAP solution, and the final model will be comprised of only high quality, pre¬ 
screened patterns. The second inference method does not pre-screen the patterns, so it searches 
over a much larger space in theory, at the expense of a more difficult computation. In practice 
either method can be used, though if the user has a preference for patterns of shorter lengths, the 
first method could be substantially faster and lead to better solutions. 

Our applied interest is to understand user response to personalized advertisements that are 
chosen based on the user, the advertisement, and the context. Such systems are called context- 
aware recommender systems (see surveys [1,2,5,47] and references therein). One major chal¬ 
lenge in the design of recommender systems, reviewed in [47], is the interaction challenge: users 
typically wish to know why certain recommendations were made. Our work addresses precisely 
this challenge: our models provide patterns in data that describe conditions on which a recom¬ 
mendation will be accepted. 

2 Related Work 

The models we are studying have different names in different fields: “disjunctions of conjunc¬ 
tions" in marketing, “classification patterns" in data mining, and “disjunctive normal forms" 
(DNF) in artificial intelligence. Learning logical models of this form has an extensive history. 
Valiant [45] showed that DNFs could be learned in polynomial time in the PAC (probably ap¬ 
proximately correct) setting, and recent work has improved those bounds via polynomial thresh¬ 
old functions [23] and Fourier analysis [14]. However, these theoretical approaches often require 
unrealistic modeling assumptions and do not incorporate a user-control over interpretability. 

In parallel, the data-mining literature has developed approaches to building logical conjunc¬ 
tive models. Associative classification methods (e.g., [10,11,26,27,31,38,51]) mine for frequent 
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patterns in the data and combine them to build classifiers, generally in a heuristic way, where 
patterns are ranked by an interestingness criteria and the top several patterns are used. Some of 
these methods, like CBA, CPAR and CMAR [10,11,26,51] still suffer from a huge number of 
patterns and do not yield interpretable classifiers, yet it is well-known that for many domains, 
the space of good predictive models is often large enough to include very simple models [20]. 
Inductive logic programming [33] is similar, in that it mines (potentially complicated) patterns 
and takes the simple union of these patterns as the pattern set, rather than optimizing the pat¬ 
tern set directly. This is a major disadvantage over the approach we take here. Another class of 
approaches aim to construct DNF models by greedily adding the conjunction that explains the 
most of the remaining data [13,16,17,28,37]. Thus, again, these methods do not directly aim 
to produce globally optimal conjunctive models. There are few recent techniques that do aim to 
fully learn DNF models [18,19], which present integer programming approaches for solving the 
full problems, and also present relaxations for computational efficiency. These are very different 
from our work in that the method of Hauser et al. [19] is not generative and also does not have 
the advantage of reduction to a smaller problem that we have. The work of Goh and Rudin [18] 
is for real valued features only, whereas we focus mainly on categorical data, though binned or 
thresholded real-valued data would suffice where the bins need not be exclusive. 

Note that logical models are generally robust to outliers and naturally handle missing data, 
with no imputation needed for missing attribute values. These methods can perform comparably 
with traditional convex optimization-based methods such as support vector machines or lasso 
(though linear models are not always considered to be interpretable in many domains). 

Dr’s of anJ’s models are also a special case of another form of interpretable model called M- 
of-A^ rules [12,15,35,42,44], inparticular when M=l. In an M-oi-N rules model, an example is 
classified as positive if at least M criteria among N are satisfied. If M=I, the model becomes a 
disjunction of conditions, and if M=N, then the model is a single conjunction. (In these models, 
one rule generally refers to one literal, whereas in our model, each pattern can have multiple 
literals.) 

The main application we consider is in-vehicle context-aware recommender systems. The 
most similar works to ours include that of Baralis et al. [7], who present a framework that discov¬ 
ers relationships between user context and services using association rules. Lee at al. [24] create 
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interpretable context-aware recommendations by using a decision tree model that considers lo¬ 
cation context, personal context, environmental context and user preferences. However, they did 
not study some of the most important factors we include, namely contextual information such as 
the user’s destination, relative locations of services along the route, the location of the services 
with respect to the user’s destination, passenger(s) in the vehicle, etc. Our work is related to 
recommendation systems for in-vehicle context-aware music recommendations, see [6,49], but 
whether a user will accept a music recommendation does not depend on anything analogous to 
the location of a business that the user would drive to. The setup of in-vehicle recommenda¬ 
tion systems are also different than, for instance, mobile-tourism guides [34,40,43,46] where 
the user is searching to accept a recommendation, and interacts heavily with the system in or¬ 
der to find an acceptable recommendation. The closest work to ours is probably that of Park et 
al. [36] who also consider Bayesian predictive models for context aware recommender systems 
to restaurants. They also consider demographic and context-based attributes. They did not study 
advertising, however, which means they did not consider the locations to the coupon’s venue, 
expiration times, etc. 

3 Bayesian or’s of and’s 

We work with standard classification data. The data set S consists of {x„, .jv, where 

Hn e {0,1} and {x„}„=i . jv has N observations and J attributes. is the class of observations 
with positive labels, and the observations with negative labels are S~. We use A to represent a 
set of patterns and a pattern in A is represented as a*, indexed byi G |yl|}. We define a 

boolean function /r(x„, a) that evaluates if pattern a applies to data point x„. Then we can define 
a classifier built from Aas f a- 

{ 1 3a G A, /r(x„, a) = 1 

( 1 ) 

0 otherwise. 

As long as a data point satisfies at least one of the patterns in A, it is classified as positive. 

Figure 1 shows an example of a pattern set. Each pattern is a yellow patch that covers a 
particular area, and the pattern applies to the area it covers. In Figure 1, the white oval in the 
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middle indicates the positive class. Our goal is to find a set of patterns A that covers mostly the 
positive class, but little of the negative class. 



Figure 1: Illustration of or’s of and’s 

We present a probabilistic model for selecting patterns. Taking a Bayesian approach allows us 
to flexibly incorporate users’ expectations on the “shape" of a pattern set through the prior. The 
user can guide the model toward more domain-interpretable solutions by specifying a desired 
balance between the size and lengths of patterns. 

3.1 Prior 

We propose two models for the prior. In the BOA-BetaBinomial model, the maximum length 
L of patterns is pre-determined by a user. Patterns of the same length are placed into the same 
pattern pool. The model uses L beta priors to control the probabilities of selecting patterns 
from different pools. In a BOA-Poisson model, the “shape” of a pattern set, which includes 
the number of patterns and lengths of patterns, is decided by drawing from Poisson distributions 
parameterized with user-defined values. Then the generative process fills it in with literals by first 
randomly selecting attributes and then randomly selecting values corresponding to each attribute. 
We present the two prior models in detail. 

3.1.1 Beta-Binomial prior 

Notice that the pattern set we want to find should include patterns that describe only the positive 
class and also discriminate it from the negative class. Thus we need only rules from the positive 
data and not from S~. We use As to represent a complete set of patterns mined from 5'+. As 
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can be further divided into pattern pools indexed by lengths. A pool containing all patterns of 
length I is denoted as In this model, interpretability of a pattern is determined by its length, 
so that the a priori probability that a pattern of length / is selected depends only on /. We use a 
beta prior on the probability pi for the inclusion of a pattern in to be in the pattern set A: 


Pi r^Baa{ai,/3i). (2) 

The parameters {o;, /3; > 0|^ G {1,L}} on the priors eontrol the expeeted number of patterns of eaeh 
length in the pattern set. Speeifieally, let denote the set of patterns seleeted from then the pattern 
set is represented by A = jA^. Define M/ = |AW|, we then have i?[M;] = |AW| ■ Therefore, 

if we favor short patterns, we eould simply inerease for smaller I and deerease the ratio for larger 1. 

The pattern set A^^l is a eolleetion of Mi patterns independently seleeted from A^^l C Af . We 
integrate out the probability pi to get the probability of A^: 

P(AW;ai,/3i) = f pf‘ {1 - pi)^-^s\-^iBetSL{pi;ai, Pi)d{pi) 
dpi 

^ r(af + A) TjMi + af)r(|Ag| - Mi + A) 
r{ai)r{Pi) + 

where the first line follows beeause eaeh pattern is seleeted independently and the seeond line follows 
from integrating over the beta prior on pi. Thus the probability of A is: 

L 

P{A-,9pnor) = llP{A^'^;CXi,Pi), (4) 

I 

where 6pnor = {cti, We usually ehoose ai <C I3i so that BOA tends to ehoose a smaller Mi for eaeh 1. 

3.1.2 Poisson prior 

We introduee a different prior for the BOA model. Let M denote the total number of patterns in A and Lm 
denote the lengths of patterns for m G {1, ...M}. There ean be at least 0 and at most | As] patterns in a set, 
and for eaeh pattern, the length needs to be at least 1 and at most the number of all attributes J. We first 
draw M and from truneated Poisson distributions to deeide the “shape” of a pattern set, then we fill in 
the patterns with literals. To generate a literal, we first randomly seleet the attributes then randomly seleet 
values eorresponding to eaeh attribute. We use Vm,k to represent the attribute index for the /c-th literal in 
the m-th pattern, Vm,k £ {!> •••«/}. ^ is the total number of values for attribute Vm,k- The generative 
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process is as follows: 


1 : Draw the number of patterns: M ~ Truncated-Poisson(AA^), M G {0, 

2 : for m G {1, ...M} do 

3: Draw the number of conditions in m-th pattern: ~ Truncated-Poisson(AL), G {1,... J} 

4: Randomly select Lm attributes from J attributes without replacement 

5: for A: G {1, ...Lm} do 

6 : Uniformly at random select a value from ^ values corresponding to atttibute Vm,k 

7: end for 

8 : end for 

We define 0prior = {Am, ^l}, and normalization constant Al), thus the probability of generat¬ 

ing a pattern set A is 


3.2 Likelihood 

Let (x„, Un) denote the n-th observation, let fA{^n) denote the classification outcome for x„, and let yn 
denote the observed outcome. Recall that /a(x„) = 1 if Af obeys any of the patterns a G A. We introduce 
likelihood parameter />+ to govern the probability that an observation is a real positive class case = 1 
when it satisfies the pattern set, and p- as the probability that yn = 0 when it does not satisfy the set. 

The likelihood of data S = {x„, y„}n given a pattern set A and parameters p+, p- is thus: 

PiS\A,p+,p_) = ]Jp:;A(^")2^"(l-^+)/A(x»)(l-Vn)^_[l-/A(x„)](l-2;„)(l _p_)[l-/A(Xn)]Vn^ (-g) 

n 

where the four components in formula (6) represent four classification outcomes: true positive, false posi¬ 
tive, true negative, and false negative. We place beta priors over and /9_: 

~ Beta(a+,/3+), P _ Beta(a_, /?_). 

Here, /3+, a-, jS- should be chosen such that E[p^] and L[p_] are close to 1 which means the clas¬ 
sification outcomes agree with the observed outcomes. Integrating out p+ and /)_ from the likelihood in 


P{A; Oprior) = AL)Poisson(M; Am) nPoisson(Lm; Al) 

m 


n 


K., 


(5) 


'^m.k 
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(6), we get 


P(S'|^,6»iikeiihood) = j PiS\A,p+,p-)Beta{a+,(3+)Beta{a-,/3-)dp+dp- 

^ r(a+ + /3+) r (Xln fA{^n)yn + «+) r fAM{i - vn) + /?+) 
r(a+)r(/3+) r(E„/A(x„)+ «+ + /?+) 

r (a_ + /3_) r (Xln (1 - /A(Xn)) j/n + /j-) r (1 - /A(Xn)) (1 - Vn) + «-) 
r(a-)r(/3_) r(EJi-/A(x„))+ «_ + /?_) ’ 

where 6*iikeiihood = {«+; /3+,/3-}. The training data ean be divided into four eases: true positives (TP 
= En fA{^n)yn), false positives (FP = Y,n /a(x„)(1 - Vn)), true negatives (TN = En (1 “ fA{^n)) (1 - 
y„)) and false negatives (FN = J2n (f “ fAi'^n)) Vn)- The above likelihood ean be rewritten as: 

_ r(a-|_ + /3-|_) r(TP + a+)r(FP + /3+) r(Q;_ + P-) r(TN + a-)r(FN + /3_) 

^ ' ’ '^“oodi - p(^^)r(/3+) r(TP + FP + a+ + /?+) ■ r(a_)r(/3_) r(TN + FN + a_ + /?_) ' 

( 8 ) 

We want to maximize the posterior, whieh is equivalent to maximizing the joint probability of S and A. 


P{S, A; 0) = P{S\A; 01ikelihood)P(^; Sprier), (9) 


where 6 = {^prioD ^likelihood} and 0prior depends on whieh prior model is used. 

4 Approximate MAP Inference 

In this seetion, we deseribe a proeedure for approximately solving for the maximum a posteriori or MAP 
solution to the BOA model. Inferenee in the BOA model is ehallenging beeause finding the best model 
involves a seareh over exponentially many possible sets of patterns: sinee eaeh pattern is a eonjunetion of 
literals, the number of patterns inereases exponentially with the number of literals, and the number of sets 
of patterns inereases exponentially with the number of patterns. We propose two variations of stoehastie 
loeal seareh algorithms, with different notions of a “neighboring” solution. The first method searehes the 
spaee by adding or removing a pattern at every iteration, and uses only pre-mined patterns. The seeond 
method searehes the spaee by adding or removing a literal at every iteration. Both methods use a simulated 
annealing approaeh with moves designed to quiekly explore promising solutions. 
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4.1 Simulated Annealing 

Maximizing the posterior is equivalent to minimizing the negative log joint probability: 


Es{A;e) = -logP{S,A-,e). 


We define as a eomplete set of all possible pattern sets: 


( 10 ) 


A 5 := {A\A = {ai, 02 , ...a|^|} where o* G ^s}. (11) 

Exhaustive evaluation of A over all As will not be feasible; if we were to brute foree seareh for the best 
elassifier out of the whole set of possible elassifiers, this would involve evaluating all possible subsets 
of patterns on the training data, and for eandidate pattern set As, there are IA 5 I = 21'^‘Sl sueh subsets. 
Simulated annealing [22] presents an alternative, and is naturally suited to approximate optimization here. 
Our simulated annealing steps are similar to the Gibbs sampling steps used by [25,48] for rule-list models. 

The seareh starts by randomly generating a pattern set. Then at eaeh iteration, an example is randomly 
seleeted from the miselassified data points. If the example is positive, it means the eurrent pattern set fails 
to eover it and we then find a neighboring solution fhaf eovers more dafa fhan fhe eurrenf solution, so we 
eall fhe aefion “COVERMORE”. If fhe example is negative, if means fhe eurrenf paffern sef eovers fhe 
wrong dafa so we need fo find a neighbor paffern sef fhaf eovers less, and we eall fhe aefion “COVER- 
EESS”. A^^^ is generafed from fhe eurrenf A^ using one of fhe fwo aefions. How fhe aefion is earried ouf 
on A^ depends on fhe level where a “neighbor” is defined. In a pattern level seareh, a ehange is made by 
eifher adding or removing a paffern; and in a literal level seareh, a ehange is made by eifher adding or 
removing a liferal. We will elaborafe on fhe aefions on fhe fwo levels later in Ibis seefion. To help avoid 
loeal minima, we use a seoring funelion fo evaluate all fhe neighboring solufions, seleef fhe besf solution 
wifh probabilify 1 — p, and seleef a random solution wifh probabilify p. Sinee fhe objeefive is fo minimize 
Es, Es nafurally beeomes fhe seoring funelion fo evaluate fhe neighboring solutions. To summarize: 

• Wifh probabilify p, move fo a randomly seleefed neighboring position, 

• Ofherwise, move from fo neighboring position wifh minimum Es{A^~^^] 6). 

Then fhe propoal is aeeepfed wifh probabilify min |l, exp } ’ where T{t) 

is fhe lemperalure and if follows a eooling sehedule T{t) = 

We repeal fhe seareh Ihree limes, from Ihree random slarling poinls, and we seleef fhe solulion wifh 
fhe highesl MAR We presenl fhe general seareh in Algorilhm 1, where fhe user ean ehoose fo do a paffern 
level seareh (see 4.2) or a literal level seareh (see 4.3). 
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Algorithm 1 Simulated Annealing for BOA 

1: procedure SLSEARCU(maxSteps, p, level) 

2 : A* a randomly generated pattern set, 

3: step ^ 0, 

4: S'e examples misclassified by A* 

5: while step<maxSteps and 7 ^ 0 do 

6 : step step + 1 

7: S'e misclassified examples by A* 

8 : ex a random example drawn from 

9: if ex is a positive example then 

10: A*+i ^ COVERMORE(/ene/, A\ p) 

11: else 

12: A*+1 ^ COVEREESS(/ene/, A\p) 

13: end if 

14: Afnax niax(Aff[ax) ^ ) 

15: q; = min |l,exp '^0 J 


A*+^, with probability a 
A*, with probability 1 — a 


17: end while 

18: return A^ax 

19: end procedure 


4.2 Pattern Level Stochastic Search 

For stochastic search over patterns, neighboring solutions are “one-pattern-different” from the current set. 
The simulation chain moves to these positions by adding or removing a pattern from the current one. 
Therefore the two actions are the following: 

• COVERMORE(“pattern”, At, p) 

- With probability p, add a random pattern to A*. 

- Else, evaluate the objective Es for all neighboring solutions where a pattern is added to A* 
and choose the one with the minimum score. 

• COVEREESS(“pattern”, At, p) 

- With probability p, remove a random pattern from A*. 

- Else, evaluate the objective Es for all neighboring solutions where a pattern is removed from 
A* and choose the one with the minimum score. 
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We note that any reasonably accurate sparse classifier should contain largely accurate patterns. Rather 
than considering all patterns (exponential in the number of attributes), we use only the pre-mined patterns. 
To efficiently search for the MAP solution, we require a minimum support to limit the number of patterns 
that are generated. This will greatly reduce the computational complexity and we will show in Section 5 
that filtering out these patterns does not affect the MAP joint probability. 

We consider both positive associations (e.g., Xj=‘blue’) and negative associations (xj=‘not green’) as 
literals. (The importance of negative literals is stressed, for instance, by [9,41,50].) We then mine for 
frequent patterns within the set of positive observations 5"+. To do this, we use the FP-growth algorithm 
[8], which can in practice be replaced with any desired frequent pattern-mining method. Even when we 
restrict the length of patterns and the minimum support, the number of generated patterns could still be 
too large to handle. (For example, a million patterns are generated for one of the advertisement datasets 
the minimum support is 5% and the maximum length is 3). Therefore, we wish to use a second criterion 
to screen for the most potentially useful Mq patterns. We first filter out patterns on the lower right plane of 
ROC space, i.e., their false positive rate is greater than true positive rate. Then we use information gain to 
screen patterns, similarly to other works [10,11]. For a pattern a, the information gain is InfoGain(5|a) = 
H{S) — H{S, a), where H{S) is the entropy of the data and H{S, a) is the entropy of data that split on 
pattern o. Given a dataset S, entropy H{S) is constant; therefore our screening technique chooses the Mq 
patterns that have the smallest H{S, a), where Mq is user-defined. We illustrafe fhe effecf of screening 



False positive rate 

Figure 2: All patterns and seleeted patterns on a ROC plane 
on one of our advertisement data sets. We mined all patterns with minimum support 5% and maximum 
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length 3. For each pattern, we computed its true positive rate and false positive rate on the training data, 
and plotted it as a dot in Figure 2. The top 5000 patterns with highest information gains are colored in red, 
and the rest are in blue. As shown in the figure, information gain indeed selected good patterns as they 
are closer to the upper left corner in ROC space. For many applications, this screening technique is not 
needed, and we can simply use the entire set of pre-mined rules. 

4.3 Literal Level Stochastic Search 

For search on literal level, a neighboring solution is generated by either adding or removing a literal from 
the current pattern set. If the example is a positive case, it means the current pattern set fails to cover it. 
In order to cover this example, we either remove a literal from a pattern or add a literal as a new pattern. 
Both options increase the support of the pattern set to cover more data points. If the example is a negative 
case, it means the current pattern set covers the wrong data so we need to either remove an existing pattern 
or add a condition to a pattern to make it not cover the example. 

• COVERMORE(“literal”, At, p) 

- With probability 0.5, do as follows: with probability p, remove a random literal from a random 
pattern; else, evaluate the objective Eg for all neighboring solutions where a literal is removed 
and choose the one with the minimum score. 

- With probability 0.5, do as follows: with probability p, add a random literal as a new pattern; 
else, evaluate the objective Eg for all neighboring solutions where a literal is added as a new 
pattern, choose the one with the minimum score. 

• COVEREESS(“literal”, At, p) 

- With probability 0.5, do as follows: with probability p, add a random literal to a random 
pattern; else, evaluate the objective Es for all neighboring solutions where a literal is added 
to a pattern in the set and choose the one with the minimum score. 

- With probability 0.5, do as follows: with probability p, remove a random pattern, else, evalu¬ 
ate the objective Es for all neighboring solutions where a pattern is removed from the current 
set and choose the one with the minimum score. 

The second action in COVEREESS removes a pattern. Intuitively it might make more sense to remove 
one literal at a time rather than removing a pattern, but actually, removing a whole pattern is often a more 
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gradual change to the model than removing a single literal. A large pattern (consisting of several con¬ 
straints on the input variables) affects fewer points than a small pattern (consisting of a single constraint). 
Thus, removing a whole pattern can often be a more “local" move than removing a single literal. 

This algorithm has some advantages over pattern level stochastic search, in that it does not need to 
pre-define a maximum length of patterns and does not need to generate patterns beforehand. This allows 
us to search the possibly huge space of patterns without enumerating all of them. 


5 Guarantees on the MAP BOA Models from the Priors 


We will show that the inclusion of the prior (which was designed to help with interpretability), provably 
assists with both computation and generalization performance. Recall the objective defined in Equation 
(10): Es{A] 6) = — log P{S, A] 6). The goal is to find an MAP pattern set A* that minimizes Es, which 
is equivalent to finding a MAP solution. We show that depending on the prior parameters and on the size of 
the data, we have deterministic (i) upper bounds on the sizes of MAP BOA models, and (ii) lower bounds 
on the support of rules we need to construct MAP BOA models. This means that practically, we need only 
mine and optimize over rules of a certain support level to obtain a MAP solution, which exponentially 
reduce the size of the BOA model computation. These bounds also directly lead to better generalization 
bounds on predictive performance. The proofs of all theorems in this section are in Appendix A. 

We first provide an upper bound for the size of BOA models that depends on the priors. That is, when 
the priors are chosen to favor smaller models, we can place an explicit guarantee on the size of the MAP 
BOA model. This bound depends explicitly on the prior parameters, the number of observations and the 
number of attributes. The reason that the size of a MAP solution is bounded is that the prior places a 
penalty large enough so that the likelihood cannot overwhelm it. We first study the BOA-Poisson model. 

Theorem 1. Take a BOA-Poisson model with parameters 6 = {a+, /?+a_, /?_, Xm, Xl}, where /?+, 
a-,/3-, Xm, Al G N"''. The dataset S has J attributes, where the j-th attribute has Kj values for j G 

e-SL(XL)’ 

{1,..., J}. Define A* 6 arg min^ Es (^) and M* : = |^* | . If — 


/ 


\ 


log 


r(«_-r/?_) r(|5-|+a_)r(|5+|+/3_) 
r(a_)r(/3_) r(|5|-ra-+/3-) 






‘i max I } 


r(j+i) 


M* <Xm + 


log 


e max { } 


r(j+i)(AM+i) 
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As we show in the proof, is the likelihood of data given an empty set, 

i.e., P(5|0; 6). As P(5|0; 6) inereases, the hound beeomes smaller, whieh means the model’s maximum 
possible size is smaller. Intuitively, if an empty set already aehieves high likelihood, adding rules will 
often hurt the prior term and aehieve little gain on the likelihood. Assuming the ratio of 5"+ and S~ 
stays the same, as the amount of data inereases, the term beeomes smaller and 

log 1+^-) beeomes more negative. Sinee the denominator is negative and other terms stay 

the same, the bound beeomes larger. The data overwhelm the prior on the size of the model. 

We have a similar upper bound for the size of a MAP BOA-BetaBinomial pattern set. 

Theorem 2. Take a BOA-BetaBinomial model with parameters 


e 


T, Q^-|-, fd-\- , cr—, [3 —, ^ ? 


w/iere L, Am) Al, q;+,/ 3+, a-,/?_, {a^, G N+. Define A* G argmin^ £'5(A), and M* := 
|A*|. Whenever ai < f3i, we have: 


M* 


A log 




r(«-+/3-) r(|5-|+a-)r(|g+|+/?-) 
r(a_)r(/3_) r(|5|+Q_+/3_) 


lo 


Similar intuition holds as for the BOA-Poisson model: the dependenee on the number of observations 
is the same for the two bounds. Additionally, when ai is set to be small and fii is set to be large, the bound 
is smaller so we are guaranteed to ehoose a smaller number of rules overall. This is eonsistent with users’ 
expeetation when they set to be small, as explained in Seetion 3.1.1. 

In the BOA-BetaBinomial model, we pre-mine patterns to filter out those with small support. Let us 
not use information gain, and diseuss the implieations of using only low support rules. This is equivalent 
to the statistieal assumption that the set of pre-mined patterns are suffieient to produee with posterior 
approximately the same as the MAP model. We will show something stronger than this. Namely, we 
show that finding a globally MAP paffern lisf (among all possible paffern lisfs) is equivalenf, under weak 
eondifions, fo fhe pattern lisf found from only fhe pre-mined patterns. This yields a major eompufafional 
benefil, as if fells us fhaf we need only mine and optimize over rules of a eerfain minimum supporf in order 
fo find fhe MAP solution. This eliminafes an enormous number of rules and deereases fhe seareh spaee 
subslanfially. The sfronger fhe prior, fhe more fraefable fhe eompufafion for Ibis model. 
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We define a subset of eontaining patterns of support at least C as: 


As := {a G : supp 5 (a) > C}. 

A^ is fully enumerated in the rule mining step. Ideally we would like to optimize over elements of ^ 5 , 
but it is only praetieal to optimize over As- We will thus prove that it is suffieient to do this in order to 
find a MAP solution; the MAP solution has only patterns that eome from A^- We define as the set of 
all possible BOA models with support at least C for eaeh pattern: 


A^ := {A\A = {ai, 02 , ...a\A\} where ai G ^ 5 }. 


If C = 0, is the set of all possible BOA models, A 5 as defined in (11). 
Theorem 3. Take a BOA-BetaBinomial model with parameters 


0 Q:-i_,/?-i_, cr—,/? —^ •; \^As ^ exp , 


w/iere L, Am, AL,a+,/?+,«_,/3_,a«, A G N+. When ^ \s-\+a-+i3- <!-««< M 

I = 1, 2...L, and 

, . f |.4g |-m;+/3; \ 

log mm 


C < 




|g+|+a+-l \S-\+a-+h- ’ 


log 


|S+|+a++/ 3 +-l (3- 

where mi is the upper bound computed in Theorem 2, then 


argmin£' 5 (A) C argminii^ 5 (A). 
A&lXs AeA^ 


This theorem states that the quantity argmin£^S'(A) on the left (whieh is eomputationally impossible 

A&Ag 

to eompute in praetiee) has the same set of minimizers as the quantity argmin£^ 5 (A) on the right (whieh 

AeAg 

is what we would eompute in praetiee). It states that we need only to mine patterns of support of at least 
C. If we solve for a minimizer of Es on these mined patterns, it is a global minimizer of Es aeross all 
pattern sets. This provides strong eomputational motivation for pre-mining patterns, sinee there are now 
weak eonditions under whieh the “approximation" of using pre-mined patterns is not an approximation at 
all. We have a similar bound on the support for BOA-Poisson models. 


Theorem 4. Take a BOA-Poisson model with parameters 6 = {a+, /)+a_, /)_, Am, Al}, where a+, /)+, 
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a-.l3-. Xm.Xl € N+. Wl,e„ 


< 1, and 


log 


rV±i) 


C < 


log 


Xms max|(|)r(J),(|)'^| 
|g+|+a+-l |5-|+a-+/3- 


1 5+1+«++/?+-! 


d- 


then 


argmin£'5(^) C argminii^5(y4). 

So far our model has been foeusing on properties of the a posteriori model. We now turn to gener¬ 
alization performanee. The following result is algorithm independent. The true risk of a pattern set A is 
defined as: 


= E^x,Y)-D{fAiX) ^Y), 


where elassifier fA{X) equals one when X obeys one of the patterns in A, and fA{X) equals zero other¬ 
wise. The true risk is the standard expeeted miselassifieation error. 

Theorem 5. Consider which is the set of BOA models parameterized by {p+, p-}, p+, p- < 1/2, 
where the number of patterns of A £ A^ obeys |yl| < Mapper- With probability 1 — 5, for all A G A^, 


^ — + V- W -■ 

This theorem states that the true risk is upper bounded by the empirieal risk of data S, in partieular the 
likelihood, and a eomplexity term. The eomplexity term inereases with ]/[^{Kj -|- 1), whieh is the number 
of all patterns. The value of Mupper can eome from either Theorem 1 or Theorem 2. 


6 Simulation Studies 

In this seetion, we present simulation studies to show that if data are generated from a fixed pattern set, our 
simulated annealing proeedure ean reeover it with high probability. We also provide eonvergenee analysis 
on simulated data sets to show that our model ean aehieve the MAP solution in a relatively short time. 

6.1 Performance variation with different parameters 

Given observations {xn}n=i,...Ar and assume there exists a true pattern set eomprised of m patterns that 
elassifies Xn, and generates the outeome pn- We want to show that simulated annealing ean diseover this 
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pattern set effieiently. Let there be a eolleetion of patterns ,. jv^ mined from the observations, and 

we ean eonstruet a binary matrix where the entry on n-th row and y-th eolumn is the Boolean funetion 
h{xn, aj), whieh represents if the n-th observation satisfies the j-th patterns. We need only to simulate this 
binary matrix to represent the observations without losing generality. Eaeh entry is set to 1 independently 
with probability 0.1. Here are the most important variables in this simulation study: 

• M: the number of eandidate patterns 

• m: the number of patterns in a true pattern set 

• N: the number of observations in a data set 

The binary matrix representing the data set has size N x M. We assume all patterns have the same length 
so we ean ignore the priors. In our experiments, a true pattern set was generated by randomly seleeting m 
patterns to form a pattern set. We used edit distanee between the true pattern set and a generated pattern 
set as the performanee measure. For eaeh number of iterations from 5000, 10000 and 20000, we repeated 
experiments in the simulation 100 times. For eaeh reeovery problem, we then used simulated annealing as 
deseribed in Seetion 4.1 with three different starting points. We reported the mean performanee in Table 1. 

Performance with size of data set, N. In the first study, we set m = 5, M = 1000, and ehose 
the sample size N from {100, 500,1000, 2000, 3000,4000}. The edit distanee was eomputed for eaeh 
of the 100 replieates and the means are reported in Table 1. Our results show that as the number of 
iterations inereases, the true pattern sets were reeovered with higher probability. However, the number of 
observations N did not have a large inlluenee on the result for N approximately greater than 500. The 
aeeuraey at = 500 is similar to the aeeuraey at = 4000. This result is quite intuitive sinee simulated 
annealing searehes over the pattern spaee, and likely finds fhe same solution onee N is suffieiently large. 

Performance with size of pattern space, M. In the seeond study, we set m = 5, N = 2000, 
and ehose the pattern size M from {100,200, 500,1000, 2000}. We repeated the above proeedure and 
reported the mean over 100 replieates in Table 1. The number of possible pattern sets of patterns is 
0{2^)\ therefore as M inereases, searehing the spaee beeomes diffieult for simulated annealing. (This 
does not mean, however, that predietion performanee will suffer; as we inerease the number of iterations, 
the mean edit distanee deereases.) We ean eompensate for larger M by running the simulation for longer 
times in order to reeover the underlying pattern. 

Performance with size of true pattern set, m. In the third study, we set N = 2000, M = 1000 
and ehose the size of the true pattern m within {1,2,4, 6, 8}. Table 1 shows that as the number of patterns 
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#of iteration=5000 

#of iteration= 10000 

#of iteration=20000 


N=100 

3.13 

2.65 

2.31 


N=500 

1.26 

0.51 

0.05 

M=1000 

N=1000 

1.43 

0.62 

0.05 

m=5 

N=2000 

1.37 

0.47 

0.06 


N=3000 

1.37 

0.41 

0.07 


N=4000 

1.43 

0.61 

0.04 


M=200 

0.00 

0.00 

0.00 

N=2000 

M=500 

0.74 

1.16 

0.00 

m=5 

M=1000 

1.37 

0.47 

0.06 


M=2000 

2.28 

1.47 

0.84 


m=l 

0.01 

0.00 

0.00 

N=2000 

m=2 

0.23 

0.08 

0.00 

M=1000 

m=4 

0.81 

0.38 

0.02 

m=6 

2.09 

0.84 

0.12 


m=8 

4.72 

4.49 

1.8 


Table 1: Mean edit distances to true patten sets with different iV, M and m. 


increases, it becomes harder for the model to recover the true pattern set; however, performance improves 
over simulated annealing iterations. 


6.2 Runtime analysis 

We show how efficiently the performance improves as the our algorithm runs. We set the size of the data 
N to be 2000 and the size of the true pattern set m to be 5. We then ran simulated annealing and recorded 
the output at steps 100, 500, 1000, 2000, 5000, 10000 and 20000. We repeated this procedure 100 times 
and plotted the mean and variance of edit distances to true pattern sets in Figure 3, along with running 
times in seconds. The algorithm is also very fast. Running times were less than one minute, even for 
20000 iterations. 


7 Experiments 

We test our model on mobile advertisement datasets that we collected as well as other publicly available 
datasets. In situations where the ground truth consists of deterministic rules (similarly to the simulation 
study), our method tends to perform better than other popular machine learning techniques. 
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Figure 3: Converence of mean edit distanee and running time with number of iterations. 

7.1 Experiments on Mobile Advertisement Datasets 

For this experiment, our goal was to determine the feasibility of an in-vehicle recommender system that 
provide coupons for local businesses. The coupons would be targeted to the user in his/her particular 
context. Our data were collected on Amazon Mechanical Turk via a survey that we will describe shortly. 
We used Turkers with high ratings (95% or above) and used two random questions with easy answers 
to reject surveys submitted by workers who were not paying attention. Out of 752 surveys, 652 were 
accepted, which generated 12684 data cases (after removing rows containing missing attributes). 

The prediction problem is to predict if a customer is going to accept a coupon for a particular venue, 
considering demographic and contextual attributes. Answers that the user will drive there ‘right away’ or 
‘later before the coupon expires’ are labeled as ‘Y = 1’ and answers ‘no, I do not want the coupon’ are 
labeled as ‘Y=0’. We are interested in investigating 5 types of coupons: bars, takeaway food restaurants, 
coffee houses, cheap restaurants (average expense below $20 per person), expensive restaurants (average 
expense between $20 to $50 per person). In the first part of the survey, we asked users to provide their 
demographic information and preferences, and in the second part, we described 20 different driving sce¬ 
narios (see an example in the appendix) to each user along with additional context information and coupon 
information. We then asked the user if s/he will use the coupon. 

For categorical attributes, each attribute-value pair was directly coded into a literal. Using marital 
status as an example, ‘marital status is single’ is converted into (MaritalStatus: Single), (MaritalStatus: 
Not Married partner), and (MaritalStatus: Not Unmarried partner), (MaritalStatus: Not Widowed). For 
discretized numerical attributes, the levels are ordered, such as: age is ‘20 to 25’, or ‘26 to 30’, etc; each 
attribute-value pair was converted into two literals, each using one side of the range. For example, age is 
‘20 to 25’ was converted into (Age:>=20) and (Age:<=25). Then each literal is a half-space defined by 
threshold values. See the appendix for a full description of attributes. 
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We will show that BOA does not lose too mueh aeeuraey on the mobile advertisement data sets (with 
respeet to the highly eomplieated blaek box maehine learning methods) even though we restrieted the 
lengths of patterns and the number of patterns to yield very sparse models. We eompared with other 
elassifieation algorithms C4.5, CART, random forest, linear lasso, linear ridge, logistie lasso, logistie 
ridge, and SVM, whieh span the spaee of widely used methods that are known for interpretability and/or 
aeeuraey. The deeision tree methods are representatives of the elass of greedy and heuristie methods 
(e.g., [10,11,13,16,17,26-28,31,37,38,51]) that yield interpretable models (though in many eases deeision 
trees are often too large to be interpretable). For all experiments, we measured out-of-sample performanee 
using AUC (the Area Under The ROC Curve) from 5-fold testing where the MAP BOA from the training 
data was used to prediet on eaeh test fold.^ We used the RWeka paekage in R for the implementations of 
the eompeting methods and tuned the hyperparameters using grid seareh in nested eross validation. For 
the pattern mining step for BOA-BetaBinomial models, we set the minimum support to be 5% and set 
the maximum length of patterns to be 3. We used information gain to seleet the best 5000 patterns to 
use for BOA. We ran simulated annealing for 50000 iterations to obtain a pattern set. In the table, BOAl 
represents BOA-BetaBinomial and BOA2 represents BOA-Poisson. 



Bar 

Takeaway 

Food 

Coffee House 

Cheap 

Restaurant 

Expensive 

Restaurant 

BOAl 

0.744 (0.021) 

0.672 (0.005) 

0.753 (0.010) 

0.736 (0.022) 

0.705 (0.025) 

BOA2 

0.756 (0.009) 

0.637 (0.023) 

0.756 (0.007) 

0.736 (0.019) 

0.707 (0.030) 

C4.5 

0.757 (0.015) 

0.602 (0.051) 

0.751 (0.018) 

0.692 (0.033) 

0.639 (0.027) 

CART 

0.772 (0.019) 

0.615 (0.035) 

0.758 (0.013) 

0.732 (0.018) 

0.657 (0.010) 

RF 

0.798 (0.016) 

0.640 (0.036) 

0.815 (0.010) 

0.700 (0.022) 

0.689 (0.010) 

Lin-Lasso 

0.795 (0.014) 

0.673 (0.042) 

0.786 (0.011) 

0.769 (0.024) 

0.706 (0.017) 

Lin-Ridge 

0.795 (0.018) 

0.671 (0.043) 

0.784 (0.012) 

0.769 (0.020) 

0.706 (0.020) 

Logi-Lasso 

0.796 (0.014) 

0.673 (0.042) 

0.787 (0.011) 

0.767 (0.024) 

0.706 (0.016) 

Logi-Ridge 

0.793 (0.018) 

0.670 (0.042) 

0.783 (0.011) 

0.768 (0.021) 

0.705 (0.020) 

SVM 

0.842 (0.018) 

0.735 (0.031) 

0.845 (0.007) 

0.799 (0.022) 

0.736 (0.022) 


Table 2: AUC comparison for mobile advertisement data set, means and standard deviations over 
folds are reported. 


We eonsidered five separate eoupon predietion problems, for different types of eoupons. The AUC’s 
for BOA and baseline methods for all five problems are reporfed in Table 2. The BOA elassifier, while 
resfriefed fo produee sparse disjunefions of eonjunefions, lends lo perform almosl as well as fhe blaek box 
maehine learning mefhods, and oulperforms fhe deeision free algorilhms. The linear modeling mefhods 
' We do not perform hypothesis tests, as it is now known that they are not valid due to reuse of data over folds. 
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(b) Coupons for coffee houses 


Figure 4: ROC for dataset of coupons for bars and coffee houses 


were eross-validated, so even when they are not restrieted to be sparse, the sparse BOA models perform 
eomparably. 


7.2 Interpretability of results 

In praetiee, for this partieular applieation, the benefits of interpretability far outweigh small improvements 
in aeeuraey. An interpretable model ean be useful to a vender ehoosing whether to provide a eoupon 
and what type of eoupon to provide, it ean be useful to users of the reeommender system, and it ean be 
useful to the designers of the reeommender system to understand the population of users and eorrelations 
with sueeessful use of the system. As diseussed earlier, or’s of and’s elassifiers are partieularly useful for 
representing eonsumer behavior. 

We show several elassifiers produeed by BOA in Figure 4. We varied the hyperparameters a+,/3+,a_,/3_ 
to obtain different sets of patterns, and plotted eorresponding points on the eurve. Example pattern sets 
are listed in eaeh box along the eurve. For instanee, the elassifier near the middle of the eurve in Figure 4 
(a) has one pattern, and reads “If a person visits a bar at least onee per month, is not traveling with kids, 
and their oeeupation is not farming/fishing/forestry, then prediet the person will use the eoupon for a bar 
before it expires." In these examples (and generally), we see that a user’s general interest in a eoupon’s 
venue (bar, eoffee shop, ete.) is the most relevant attribute to the elassifieation outeome; it appears in every 
pattern in the two figures. 


22 





7.3 Experiments with UCI data sets 

We tested BOA on several datasets from the UCI maehine learning repository [4], along with baseline 
algorithms, and Table 3 displays the results. We observed that BOA aehieves the best performanee on 
eaeh of the data sets we used. This is not a surprise: most of these data sets have an underlying true 
pattern set that greedy methods would have diffieulty reeovering. For example in the tie-tae-toe data set, 
the positive elass ean be elassified using exaetly 8 eonditions. BOA has the eapability to exaetly learn 
these eonditions, whereas the greedy splitting and pruning methods that are pervasive throughout the data 
mining literature (e.g., CART, C4.5) and eonvexified approximate methods (e.g., SVM) have substantial 
diffieulty with this. We also added 30% noise to the tie-tae-toe training set and BOA was still able to 
aehieve perfeet performanee while performanee of other methods suffered. Both linear models and tree 
models exist that aehieve perfeet aeeuraey, but the heuristie splitting/pruning and eonvexifieation of the 
methods we eompared with prevented these perfeet solutions from being found. 



Monk 1 

Mushroom 

Breast Caneer 

Conneet4 

Tie-tae-toe 

Tie-tae-toe 
(30% noise) 

BOAl 

1.000 (0.000) 

1.000 (0.000) 

0.990 (0.003) 

0.926 (0.002) 

1.000 (0.000) 

1.000 

(0.000) 

BOA2 

1.000 (0.000) 

1.000 (0.000) 

0.996 (0.007) 

0.902 (0.002) 

1.000 (0.000) 

1.000 

(0.000) 

C4.5 

0.906 (0.067) 

1.000 (0.000) 

0.873 (0.017) 

0.867 (0.002) 

0.949 (0.016) 

0.942 

(0.022) 

CART 

0.826 (0.061) 

1.000 (0.000) 

0.978 (0.010) 

0.703 (0.003) 

0.966 (0.011) 

0.962 

(0.014) 

RF 

1.000 (0.000) 

1.000 (0.000) 

0.970 (0.016) 

0.940 (0.002) 

0.991 (0.003) 

0.989 

(0.006) 

Lin-Lasso 

0.556 (0.061) 

0.995 (0.002) 

0.985 (0.005) 

0.858 (0.002) 

0.986 (0.002) 

0.854 

(0.019) 

Lin-Ridge 

0.560 (0.078) 

0.999 (0.000) 

0.987 (0.003) 

0.857 (0.002) 

0.931 (0.017) 

0.820 

(0.033) 

Logi-Lasso 

0.666 (0.084) 

0.989 (0.002) 

0.988 (0.003) 

0.859 (0.002) 

0.988 (0.002) 

0.860 

(0.029) 

Logi-Ridge 

0.686 (0.103) 

0.999 (0.000) 

0.988 (0.003) 

0.857 (0.002) 

0.869 (0.025) 

0.805 

(0.032) 

SVM 

0.957 (0.034) 

0.999 (0.000) 

0.986 (0.005) 

0.924 (0.002) 

0.993 (0.001) 

0.992 

(0.002) 


Table 3: AUC comparison for some UCI data sets 


As an example, we illustrate BOA’s output on the breast eaneer dataset. BOA took exaetly 2 minutes 
on a laptop to generate the following pattern set: 

if X satisfies (Marginal Adhesion > 3 AND Uniformily of Cell Shape > 3) 

OR (Clump Thiekness > 7) 

OR (Bland Chromatin > 4 AND Uniformity of Cell Size > 1 AND Clump Thiekness > 2) then 
Prediet the tumor is malignant 

else 

Prediet the tumor is benign. 

end if 
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The out-of-sample aecuraey of this model was 0.952, with true positive rate 0.974, and false positive 
rate 0.060. Or’s of anr/’s models eould potentially be useful for medieal applications, since they could 
characterize simple sets of conditions that would place a patient in a high risk category. This may be more 
useful in some cases than the typical scoring systems (linear models) used in medical calculators. 

8 Conclusion 

We presented a method that produces or’s of and’s models, where the shape of the model can be controlled 
by the user through Bayesian priors. In some applications, such as those arising in customer behavior 
modeling, the form of these models may be more useful than traditional linear models. Since finding 
sparse models is computationally hard, most approaches take severe heuristic approximations (such as 
greedy splitting and pruning in the case of decision trees, or convexification in the case of linear models). 
These approximations can severely hurt performance, as is easily shown experimentally, using datasets 
whose ground truth formulas are not difficult to find. We chose a differenf fype of approximafion, where 
we make an up-fronf sfafisfical assumption in building our models ouf of pre-mined rules, and find fhe 
globally opfimal solution in fhe reduced space of rules. We fhen find fheorefical conditions under which 
using pre-mined rules provably does nol change fhe sef of MAP opfimal solufions. These conditions relate 
fhe size of fhe dafasef fo fhe sfrengfh of fhe prior. If fhe prior is sufficienlly sfrong and fhe dafasef is nof 
loo large, fhe sef of pre-mined rules is provably sufficienf. We showed fhe benefifs of fhis approach on a 
consumer behavior modeling applicafion of currenf inferesf fo “connecfed vehicle" projecfs. Our resulfs, 
using dafa from an exfensive survey faken by several hundred individuals, show fhaf simple pafferns based 
on a user’s confexf can be direclly useful in predicfing fhe user’s response. 
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A Proofs 


We start with the following lemma that we will need later. 

Lemma 1. Define a function g{l; A, J) = T{J — I + 1), X, J G N"*". If 1 < I < J, g{l; A, J) < 
J) where gmaxi>^, J) = max|(|) T{J), 

Proof. (Of Lemma 1) In order to bound g{l; A, J), we will show that g(l; A, J) is convex, which means 
its maximum value occurs at the endpoints of the interval we are considering. The second derivative of 
g{l] A, J) respect to I is 


g"{l;\,J) = g(l;\,J) 



k J — Lf 


+ 


CXD 

E 

k 


^^{k + J-If 


> 0 , 


since at least one of the terms > 0. Thus g{l] A, J) is strictly convex. Therefore the maximum of 

g{l] A, J) is achieved at the boundary of Lm, namely I or J. So we have 


g{l;X,J) < max {g{l; \, J), g{J; X, J)} 


= max<^ - r(J), - 


— 9max(,^i J)- 


( 12 ) 


Proof. (Of Theorem 1) 

Let 0 denote an empty set where there are no patterns and A* is the MAP pattern set. Since A* G 
arg min^ Es{A), we have 

P{S,A*-,9)> P{S,9-,9), (13) 

and the joint probabilities can be written as 


P{S,A*-,9) = P{A*-,9)P{S\A*;9), (14) 

P(5,0;0) = P(0;0)P(5|0;0). (15) 
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Now we bound priors and likelihoods for these two joint probabilities. 

Step 1: Upper Bound for P{S, A *; 9). We first look at the prior and likelihood for the MAP pattern set A*. 
The lengths of patterns in A* are denoted as Lm, m G {1, so the prior probability of selecting 

A* is 


Af * ^ Lm 

P{A*; 9 ) = uj{Xm, \L)Poisson{M*; \M)WPoisson{Lm; ^ 

\Lm/ 


m 

M* 


'^m.k 


= ^(A„, A„) n ^ n K. 


'^m.k 


< oj{\m, \L)Poisson{M*] Am) 


r(j + i) 


n(^) nJ-Lm + l) 


(16) 


(16) follows from k since all attributes have at least two values. Using Lemma 1 we have that 


9{Pm'i Xfj, J) 



Lm 

r( J — Lm +1) 


^ 9nmxi,X^ J). 


(17) 


Combining (16) and (17) we have 

P{A*-e) < \L)Poisson{M*- Am) 

The maximum likelihood of data is achieved when all points in the training set are classified correctly, and 
the likelihood is 1, 

P{S\A*-,e) <1. (19) 

Combining (18) and (19), the joint probability of S and A* obeys 


P{S,A*;e) = P{A*;e)P{S\A*;e) 

< u;{Xm, XL)Poisson{M*; Am) T{T+1) 

Step 2: Lower Bound for P{S, 0; 9). Next we bound the prior and likelihood for an empty set. The prior 
probability for selecting an empty set is 

P{9-, 9) = uj{Xm, XL)Poisson{0-, Am) = w(Am, XL)e~^^. (21) 
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Therefore the joint probability of S and 0 is: 


Pis, 0; 9) = P(0; 0)P(5|0; 9) = w(Am, AL)e-^"P(5|0; 9). (22) 


Combine Steps 1 and 2. Now we use inequality in (13), and substitute individual terms with (20) and (22), 
we get 


lo{Xm, Al)- 




M*\ 


r(J + l) 

r(J+i) 

M*\ 


M* 


M* 


> w(AM,AL)e-^^P(5|0;0) 


> ^(5|0;^). 


(23) 


For simplicity we define x = - — ^9max(NJ)^M ^ ]\/[* < j/jg statement of the theorem holds trivially. 
For the remainder of the proof we consider, if M > Am. the left side of (23) becomes and can be 
upper bounded by 


< 




M*\ - Am! (Am + 1)(^*-^")’ 

where in the denominator we used M*\ = Am!(Am + 1) • • • M* > Am!(Am + gg /lave 


Am! \Xm + 1 




> PiS\9]9). 


(24) 


e-^L(^] 


e Am 


e-^-L(^')r(J)AM 

By the theorem’s assumption, we have —< 1, so < 1- We also have < 

1. To see this, note < 1, < 1 <9nd (^) ^ ^ every Ap. Therefore 

r(j+i; 
yields: 


^M + 1 


p ( ^ Solving for M* in (24), using y3fv <llo determine the direction of the inequality 


M* <\m + 


Am + 1 

\og P{S\%-9)-^ 


(25) 


Now we compute the likelihood for the empty set BOA model. The empty set classifies all data points as 


negative, so TP = FP = 0,FN = IS+I, TN = |5 |. The likelihood is 

r(a_ + /?_) r(|5-| + a-)r(|5+| + /?_) 


P(5|0;0) = 


r(a_)r(/3_) r(|5|+ «_ + /?_) 


(26) 


Combining (25) with (26) and also substituting x with its definition 


e -^^Am max { } 

rV+T) 


we 
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have 


( 


\ 


log 


r(a_+/3_) r(|5-|+a-)r(|5+|+/3_) 
r(«_)r(/3_) r(|5|+a-+/3-) 


Am! 




M* <\m + 


\ 




V 


r(j+i) 


log 


max { } 

r(j+i)(AM+i) 


(27) 


Proof. (Of Theorem 2) Since the BOA-BetaBinomial and BOA-Poisson have the same likelihood model, 
the inequality (13) still holds and we can reuse the likelihood expressions (19) and (26). 

For an empty model, the likelihood is in (26). The BOA-BetaBinomial prior for an empty set is 


P(0;0) 


TT ^{oq+f^l) r(|^y|+/j;) 

V r{\A^y+ai + /3i)' 


(28) 


For the MAP solution A*, we define that there are patterns selected from .4^^ for I G {1, L}, and 

M* = Ml . The prior probability of selecting A* is 


= IT r(«z + A) r(M; + aQrdTgi - m* + A) 
r(aOr(A) r{\A^y + ai + Pi) 

^ TT r(af + (3i)g2{Mi) 

Vr(az)r(A)r(|Tf| + az + A)’ 

where we define 

g2{Ml) = T{M; + aOr(|Tf I + A - Mf). (30) 

Now we apply the same inequality P(A*] 6)P{S\A*; 9) > P(0; 6*)P(5|0; 9) used for the proof for Theo¬ 
rem 1 and substituting in (19), (26), (28) and (29) on individual terms, we find 


TT r(a; + j3i)g2{Mi) ^ P(S\%- 9) ■ TT + J^dTgl+A) 
y r(az)r(A)r(|4]I + «, + /?,) “ ’ V r(|4| + ai + (3i) 

L L 

n 92{Mn > P(5|0; 9) ■ n r(|4 1 + (3i)T{ai). (31) 

i i 
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We want to find a lower bound for g 2 {Mi)for each 1. To do this, we use 


X\g^{Mt) = g2{Mf)Wg2{Mt), (32) 

i 

where we will trivially upper bound Yii'y^i' ) leave only the g 2 {Mi,) term. In particular, we 

observe g 2 {M^ ) < 52 ( 0 ) for all Mf To obtain this, we take the derivative of g 2 {M^ ) with respect to Mfi 
as follows: 


(MD =g2{Mn 


+ k + M* + ai-l 


since by the assumption of the theorem ai < fii, thus ai < | + fii, and each Mi < |, which means 

each term in the summation is less than 0. Thus < 0, hence g 2 {Mi) < ( 72 (0). Therefore we can 

derive an inequality for each 1: 

L 

llg 2 {Mn < 92{Mt,) n g2{0)forle{l,...,L} 
l l=l,...L,l^l' 


52(mp) H r(az)r(|4i| + A). 




Combining (31) with (33) we have 


92iMn > P( 5 | 0 ; 0 )r(| 4 | + A)r(a/) for I G 


We write out g 2 {Mi ) in (30) as the following, multiplying by 1 in disguise: 


92{Mi) =T{ai)ai ...(ai + Mf - 1) 


r(|4| + A - m;)(|4| + a - M*)... (|4| + A -1) 


^ (i4i + a-m;)...(i4i+a-i) 

=r(«or(l4l + A) rn + - 

(|4|+A-ma...(|4i + A-i) 


<r(a0r(l4l + A) 


cxi + Ml — 1 

i-4,y I+A “ 1 


<r(a0r(|^^1 + A) 


W| , / 1-^5 I + 1 


I+A “ 1 
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Here, (35) follows because 


Combining with (34), we have 


p{s\%-,e) < 


( I+~ 1 
\ i-^y I+A—1 


M* 


Substituting in equation (26) and using ai < f3i, we get 


Ml < 


log 


r(«-+/3-) r{\s-\+a-)r(\s+\+h-) 

r(«_)r(/3_) r(|s|+a_+/3_) 


log 


|-4.g l+Qj —1 

\A^§\+hi-l 


(36) 


which holds for I G {1...L}. Thus 


M* 




< 


E 


log 


r(a_+/3_) r(|5-|+a_)r(|5+|+/3_) 
r(a_)r(/3_) r(|5|+«-+/3_) 


log 


i-^y i+Qj—1 

i4'i+a-i 


(37) 


Proof. (Of Theorem 3) For a pattern set A, we will show that if any pattern Oz has support supps{az) < C 
on data S, then A 0 arg mini75(yl'). Assume pattern Oz has support less than C and define which 

A'&ks 

has the z-th pattern removed from A: 


-^\z 

Assume A consists of M rules, among which Mi come from pool of rules with length I, I G {1, ...L}, 

and the z-th rule has length I' so it is removed from .4^ . Define TP, FP, TN and FN to be the number 
of true positives, false positives, true negatives and false negatives in S given A. We now compute the 
likelihood for model A\^. The most extreme case is when pattern Oz is an accurate rule that applies only 
to real positive data points and those data points satisfy only a^. Therefore once removing it, the number 
of true positives decreases by suppg(az) and the number of false negatives increases by suppg(az)- 
Step 1: Relate P(5|gI\^; 9) to P{S\A-, 9). 

p(q\A px ^ r(a+ + (3+) T(TP + a+- supps{az))T{FP + /3+) 

^ >-T{a+)T(l5+)T{TP + FP + a+ + l5+-supps{az)) 

r(a_ + /)_) T(TN + a-)T{FN + (3- + supps{az)) 
r(a_)r(/)_) T(TN + FN + a_ + /)_ + supps{az)) 

=P{S\A-9)-gii{TP,FP,TN,FN-a+,l3+,a.,p.,supps{az)), ( 38 ) 
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where 


g 3 {TP,FP, TN,FN; a+, /3+,a-, (3-,supps{az)) = 

T{TP + a+ - supps{az)) T{TP + FP + a_|_ + /?+) 

T{TP + a+) T{TP + FF + a+ + /?+ - supps{az)) 

T{FN+ +supps{az)) T{TN + FN +a- +/3_) 

r(FA/'+ /?_) r(rA/' + fa/"+«_+/?_ + supps{az))' 

Now we break down g 3 {TP,FP, TN, FN] a+, f3+,a-, l3-,suppg{az)) to find a lower bound for it. The first 
two terms in (39) become 


T(TP + a+ - suppsittz)) 


r(FF + FF + chj^ + /3-|_) 


T{TP + a+) T{TP + FF + a+ + /3+ - supps{az)) 

(TP + FF + + /3-|_ — suppg(az)) ■ ■ ■ (TP + FF + Q!_|_ + /3_|_ — 1) 

(FF + a+ - supps(az))... (TP + a+- 1) 

FF + FF + a+ + /3+ - 1 ^ 


> 


TP “h — 1 


V I'S'+I + a+ — 1 / 


(40) 


Equality holds in (40) when TP = |5''*“|,FF = 0. The last two terms in (39) become 


T(FN + (3- + supps(az)) 


T(TN + FN + a- + j3-) 


r(FA^ + (3—) r(FA^ + FN + cx— + (3— + suppg(az)) 

_ (FN +/3-).. .(FN +(3-+ supps(az) - 1) 


> 


> 


(TN + FN + a-+ (3-) ...(TN + FN+a- +(3-+ supps(az) 
FN + f3- \ 


- 1 ) 


FN+TN+a- +/3-^ 

^ \ supps(az) 

I S~ I + CX— + (3— 


(41) 


Equality in (41) holds when TN = [F |, EN = 0. Combining (39), (40) and (41), we obtain a lower bound 
for g 3 (TP, EP, TN, EN; a+, (3+,a-,/3-,supps(az)) as 


g3(TP,EP, TN, EN, a+,/3+,a-, l3-,supps(az)) > 


/ + /?+ — 1 /3— 

\ + 0 + ~ 1 liS* I + CX— + f3— 


suppg{az) 


Eollowing (38), 


p{s\A\,,e)> 


/ 1I + Q!-|_ + /3-|_ — 1 (3- 

+ Ck+ ~ 1 I *5” I + CX— + (3- 


suppfiaz) 

■P(S\A;e). 


(42) 
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Step 2: Relate P{A\/, 6) to P{A; 6). Since A\^ consists of the same rules as A except missing one rule 
\v\ 

from , we multiply P{A\^; 9) with 1 in disguise to relate it to P{A; 9). 


P{A\z;0) 


T{ai, + AO r(M^ - 1 + a^)r(|4’l -Mi + 1 + (dy) 
r(az/)r(A/) Y{\A^^\+ar + I3i>) 

A T{ai+f3i) T{Mi + ai)T{\A^§\-Mi + (3i) 

yr(«0r(A) r(|4i| + a, + A) 

t'Al'+hl') r(M;/ —l+a;/)r(|^y | —M; + l+/3;/) y-jL r(o;+/3;) r(M;+a;)r(|yiy | —M;+/3;) 
TK'W) ri\W§\+a^,+0^,) Ah^i'ria,)m) r(|.4yi+a,+ft) 

-r-ri r(a;+ft) r(M;+a;)r(|^y | —M;+/3;) 

llz r(aOr(A) r(i4i|+a,+A) 

T(Mi, - 1 + az0r(|4']| -Mi, + 1 + AO 
[//] 

T{Mi, + ar)T{\Al^\- Mr + Pr) 


|4'1| -M^ + A' 

Mr — 1 + a;/ 


P{A-9). 


P{A;9) 


\Ap\-Mr+f, 


decreases monotonically as Mr increases, therefore it is lower bounded at the maximum of 


Mr- We use mi to denote the upper bound for Mr from (36), i.e.. 


m 


, r(«_+/3_) r{\s-\+a.)r{\s+\+M) 

r(a_)r(/3_) r(\S\+a-+f3-) 


log 


i-^y i+Qj—1 

i4'i+a-i 


so 


Therefore 


Mr - 1 + ar mr - I + ar 


P{A\^;9) > min 


/ 14| - m/ + a \ 

y mi-l + ai J 


P{A;9). 


(43) 


Step 3: Combine Step 1 and Step 2. Combining (42) and (43), the joint probability of S and A^^ is 
bounded by 


P{S,A\,-9) = P{A\,;9)P{S\A\,,9) 

> / 14^| - m; + a \ 0 |5+|+a+ + ^+- 1 /?_ 

I i 771/ — 1 “h Oil J Y ^ I 


suppQ{az) 

■Pis, A-9). 
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In order to get P{S, ^ P{S, A', 0), we need 


( \A^S\- mi +/3i \ / |5+|+«+ + /?+-! /?_ 

y mi — 1 -\- ai j \ \ S~‘^ \ + «+ — 1 | S~ \ + a— + j3— j 


We have 


|5+|+a++/3+-l /3_ 

|S+|+a+-l |S-|+a_+/3_ 


< Ifrom the assumption in the theorem’s statement, thus 


suppsittz) < 


log min 


f 1-^5 

I mi-l+OLi j 


log 


|S+|+a+-l |S-|+q:_+/3_ • 

|5+|+a++/3+-l 13. 


Proof. (Of Theorem 4) Similar to the proof for Theorem 3, we will show that for a pattern set A, if any 

pattern has support suppg(az) < C on data S, then A 0 argminiii 5 (yl'). Assume pattern Oz has 

A'&As 

support less than C and A^^ ^eis the k-th pattern removed from A. Assume A consists of M rules, and the 
z-th rule has length Lz- 

Step 1 is the same as in the proof for Theorem 3, we relate P(A\z:^) with P(A]0). We multiply 
P{A\/, 6) with 1 in disguise to relate it to P{A; 6): 


^ Lm 

P{A\z,; 9) = uj(Xm, \L)Poisson(M — 1; Am) Poisson(Lm'-, ^^)7TTn;^ 

\Lmi 


m^z 

tM 


'^m.k 


w(Am, XL)Poisson{M - 1; Am) Al) Fife 

\Lm) 


w(Am, \L)Poisson(M]\M) Poisson{Lm] 0 


■P(A; 0 ) 


(4) 


Ik K, 


m,k 


MT{J+l) 


> 


> 


Xmg Xj^^T(J — Lz + 

Mr(j + i) 

Xmb-^^ r(J - Lz + 1) 

Mr(j +1) 

XMe~^Lg(Lz] Xl, J) 
r(j + i) 


- T - 


p{A-,e) 


p{A-e) 


Am6 dmaxiXlj, JJ 


p{A-,e), 


(44) 

(45) 

(46) 


where (44) follows that k — since all attributes have at least two values, (45) follows the definition 
of g(l; X, J) in Lemma 1, and (46) uses the upper bound in Lemma 1 and M > 1. 
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Then combining (42) with (46), the joint probability of S and is lower bounded by 


p{s,A\,-,e) = p{A\,-,e)p(s\A\,,e) 

y r(j + 1) / + a+ + /?+ — 1 __ 

^MS~^^9maxi^L, J) V |<S'+|+Q;+ — 1 |S'“|+a_+/3_ 

In order to get P{S, 9) > P{S, A; 9), we need 


suppg{az) 

■Pis, A-9). 


r(</ + 1 ) / + ck+ + (3+ ~ 1 (3— 

^MS~^^9maxi^L, -J) V |<S'+|+Q;+ — 1 | 5 “ | +«_+/?_ 


suppg(az) 

> 1 


have ^|g-|+a_+/ 3 _ ^ 1 and gmaxi^L, J) = max|(|) r(J), thus 


log 


rV+i) 


suppsiaz) < 


Amb max|(|)r(J),(|)''^| J 
] |5+|+a+-l |5-|+«-+/3- • 

^^6 I,S'+|4-^ , 4-fl , -1 I3_ 


\S+\+a++f)+-l 

Proof. (Of Theorem 5) Consider the empirical risk on data S given p+, /9_ ; 


1 ^ 


jY ^yn¥=fA{^n) 


n=l 


4 E »+ E «+ E 1 + E 1 

\f{xn)=l,y„=l /(x„)=0,j/„=0 /(x„)=l,j/„=0 /(x„)=0,j/„ = l 

<—( ^ log P+ , log P- 

N \ ^ lo 2 ! — * l 02 ! — 

\ /(Xn) = l,yn = l 2 /(Xn)= 0 ,yn =0 2 

_l_ ^ log(l - p+) ^ y-v log(l-/J-)^ 


< 


/(x„) = l,J/n=0 

logP(5|p+,p-) 

Nlogl 




Since p+,p- > (48) follows 


(47) 


(48) 

(49) 


0 < 


log P+ log P- 
log7 ’ log 7 


log(l-/ 0 +) log(l-/ 5 _) 


log 7 


log 7 


< 1 

> 1 
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Using Hoejfding’s Inequality and the union bound, we can get, with probability 1 — 5, for all A G 


(50) 

where denotes the class of BOA models. jA"^! can be computed by counting the number of patterns 
sets of different sizes up to Mapper, which is in Theorem 1 and Theorem 2 for the two BOA models. 


Mu, 


|A^I = E 


m=l 




m 


AI 5 contains all patterns. 


Therefore 


\As\ = lliK, + l). 


M, 


upper /r-rj 


|A^I < E 


m=l 


n-iK, + i) 


m 


(51) 


Combining (49), (50) and (51), we have 




\ogP{S\A-p+,p.) , 


iVlogl 


+ 


2N 


(52) 


B Mobile Advertisement Datasets 

The attributes of this data set include: 

1. User attributes 

• Gender: male, female 

• Age: below 21, 21 to 25, 26 to 30, etc. 

• Marital Status: single, married partner, unmarried partner, or widowed 

• Number of children: 0, 1, or more than 1 

• Education: high school, bachelors degree, associates degree, or graduate degree 

• Occupation: architecture & engineering, business & financial, etc. 

• Annual income: less than $12500, $12500 - $24999, $25000 - $37499, etc. 

• Number of times that he/she goes to a bar: 0, less than 1, 1-3, 4-8 or greater than 8 

• Number of times that he/she buys takeaway food: 0, less than 1, 1-3, 4-8 or greater than 8 
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• Number of times that he/she goes to a coffee house: 0, less than 1, 1-3, 4-8 or greater than 8 

• Number of times that he/she eats at a restaurant with average expense less than $20 per person: 
0, less than 1, 1-3, 4-8 or greater than 8 

• Number of times that he/she goes to a bar: 0, less than 1, 1-3, 4-8 or greater than 8 

2. Contextual attributes 

• Driving destination: home, work, or no urgent destination 

• Location of user, coupon and destination: we provide a map to show the geographical location 
of the user, destination, and the venue, and we mark the distance between each two places with 
time of driving. The user can see whether the venue is in the same direction as the destination. 

• Weather: sunny, rainy, or snowy 

• Temperature: 30F°, 55F°, or 80F° 

• Time: 10AM, 2PM, or 6PM 

• Passenger: alone, partner, kid(s), or friend(s) 

3. Coupon attributes 

• time before it expires: 2 hours or one day 

All coupons provide a 20% discount. The survey was divided into different parts, so that Turkers without 
children would never see a scenario where their “kids" were in the vehicle. Figure 5 shows an example of 
scenarios in the survey. 


Destination: Home Driving alone Snowy, 30°F Time: 10 PM 



Will you get and use the coupon: 

Yes, and I'll consider driving there right away 

Yes, and I'll consider driving there later before the coupon expires 

No, I do not want the coupon 


Rate this coupon from l(not interested at all) -- 5 (very interested): 


2 



4 


Figure 5: An example of seenario in the survey 
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